Distributed High-Dimensional Index Creation using Hadoop, HDFS and C++

نویسندگان

Gylfi Þór Gudmundsson

Laurent Amsaleg

Björn Þór Jónsson

چکیده

This paper describes an initial study where the opensource Hadoop parallel and distributed run-time environment is used to speed-up the construction phase of a large high-dimensional index. This paper first discusses the typical practical problems developers may run into when porting their code to Hadoop. It then presents early experimental results showing that the performance gains are substantial when indexing large data sets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High Scalability of HDFS using Distributed Namespace

In data intensive computing, Hadoop is widely used by organizations. The client applications of Hadoop require high availability and scalability of the system. Mostly, these applications are online and their data growth rate is unpredictable. The present Hadoop relies on secondary namenode for failover which slows down the performance of the system. Hadoop system’s scalability depends on the ve...

متن کامل

Live Website Traffic Analysis Integrated with Improved Performance for Small Files using Hadoop

Hadoop, an open source java framework deals with big data. It has HDFS (Hadoop distributed file system) and MapReduce. HDFS is designed to handle large amount files through clusters and suffers performance penalty while dealing with large number of small files. These large numbers of small files pose a heavy burden on the NameNode of HDFS and an increase execution time for MapReduce. Secondly, ...

متن کامل

NameNode and DataNode Coupling for a Power-Proportional Hadoop Distributed File System

Current works on power-proportional distributed file systems have not considered the cost of updating data sets that were modified (updated or appended) in a low-power mode, where a subset of nodes were powered off. Effectively reflecting the updated data is vital in making a distributed file system, such as the Hadoop Distributed File System (HDFS), power proportional. This paper presents a no...

متن کامل

Optimistic Concurrency Control in a Distributed NameNode Architecture for Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is the storage layer for Apache Hadoop ecosystem, persisting large data sets across multiple machines. However, the overall storage capacity is limited since the metadata is stored in-memory on a single server, called the NameNode. The heap size of the NameNode restricts the number of data files and addressable blocks persisted in the file system. The H...

متن کامل

Ceph as a scalable alternative to the Hadoop Distributed File System

[email protected] THE HADOOP D I S TR I BUTED F I L E System (HDFS) has a single metadata server that sets a hard limit on its maximum size. Ceph, a high-performance distributed file system under development since 2005 and now supported in Linux, bypasses the scaling limits of HDFS. We describe Ceph and its elements and provide instructions for installing a demonstration system that can be used...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

Distributed High-Dimensional Index Creation using Hadoop, HDFS and C++

نویسندگان

چکیده

منابع مشابه

High Scalability of HDFS using Distributed Namespace

Live Website Traffic Analysis Integrated with Improved Performance for Small Files using Hadoop

NameNode and DataNode Coupling for a Power-Proportional Hadoop Distributed File System

Optimistic Concurrency Control in a Distributed NameNode Architecture for Hadoop Distributed File System

Ceph as a scalable alternative to the Hadoop Distributed File System

عنوان ژورنال:

اشتراک گذاری